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Final  Report  for  ARMY  STIR  Grant  W911NF-10-1-0360:  Inferring  Implicit 
Human  Social  Network  Structure  from  Multi-modal  Data 


Summary: 

This  proposal  was  a  9-month  STIR  that  explored  the  development  of  algorithms  with 
provable  guarantees  for  Markov  Random  Fields  (graphical  models)  structure  learning, 
with  applications  to  social  networks. 

Markov  Random  Fields  (MRFs),  a.k.a.  Graphical  Models,  serve  as  popular  models  for 
networks  in  the  social  and  biological  sciences,  as  well  as  communications  and  signal 
processing.  A  central  problem  is  one  of  structure  learning  or  model  selection:  given 
samples  from  the  MRF,  determine  the  graph  structure  of  the  underlying  distribution. 
When  the  MRF  is  not  Gaussian  (e.g.  the  Ising  model)  and  contains  cycles,  structure 
learning  is  known  to  be  NP  hard  even  with  infinite  samples.  Existing  approaches 
typically  focus  either  on  specific  parametric  classes  of  models,  or  on  the  sub-class  of 
graphs  with  bounded  degree;  the  complexity  of  many  of  these  methods  grows  quickly  in 
the  degree  bound.  We  develop  a  simple  new  ‘greedy’  algorithm  for  learning  the  structure 
of  graphical  models  of  discrete  random  variables.  It  learns  the  Markov  neighborhood  of  a 
node  by  sequentially  adding  to  it  the  node  that  produces  the  highest  reduction  in 
conditional  entropy. 

In  our  work,  we  provide  a  general  sufficient  condition  for  exact  structure  recovery  (under 
conditions  on  the  degree/girth/correlation  decay),  and  study  its  sample  and  computational 
complexity.  We  then  consider  its  implications  for  the  Ising  model,  for  which  we  establish 
a  self-contained  condition  for  exact  structure  recovery. 

Further,  we  present  numerical  results  that  highlight  the  applicability  of  this  approach  for 
social  network  relationship  learning.  The  results  summarized  in  this  document  are 
elaborated  in  much  greater  technical  depth  in  the  included  technical  report.  An  early 
version  of  some  of  the  results  that  resulted  from  this  STIR  are  presented  in: 

P.  Netrapalli,  S.  Banerjee,  S.  Sanghavi,  and  S.  Shakkottai.  Greedy  learning  of  Markov 
network  structure.  In  48th  Annual  Allerton  Conference  on  Communication,  Control  and 
Computing,  pages  1295  -1302,  Sept.  29  -  Oct.  1  2010. 

Outline  of  Results  in  the  Technical  Report: 

1.  Algorithm:  A  greedy  algorithm  is  proposed  for  learning  (pp.  7)  that  takes  as  input, 
samples  from  the  MRF  and  outputs  the  graph  structure.  This  is  done  in  a  sequential  and 
greedy  manner,  where  a  node  at  each  time  adds  a  single  additional  node  as  a  neighbor 
that  most  decreases  its  conditional  entropy  conditioned  its  neighborhood. 

2.  Result:  Under  non-degeneracy,  degree  bounds,  and  correlation  decay  assumptions,  we 
show  that  this  algorithm  recovers  the  correct  graphical  model  structure.  We  further  show 
that  an  Ising  model  (with  some  assumptions)  satisfy  these  conditions,  see  Theorem  7,  pp. 


14  in  the  included  technical  report. 


3.  We  study  the  applicability  of  the  algorithm  for  the  well-known  senator  voting  records 
dataset  (see  pp.  16,  and  O.  Banerjee  et.  al.  pp.  18),  and  demonstrate  that  the  algorithm 
recovers  our  intuition  on  voting  patterns  (e.g.,  same  state  senators  tend  to  vote  together, 
same  part  senators  tend  to  vote  together);  however,  the  algorithm  does  so  purely  based  on 
the  data  and  with  no  “side  information”  on  political  knowledge.  Further,  the  algorithm 
reveals  the  senators  who  tend  to  vote  “across  the  aisle”.  See  plot  below,  and  also  pp.  20 
of  the  attached  technical  report. 
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Figure  1:  Following  the  approach  in  Banerjee  et  al.,  2008,  we  present  an  application  of 
our  algorithm  to  model  senator  interaction  graph  using  the  senate  voting  records.  Blue 
nodes  represent  democrats,  red  nodes  represent  republicans  and  black  node  represents  an 
independent.  We  use  a  value  of  0.05  for  s  in  the  algorithm.  We  can  make  some 
preliminary  observations  from  the  graph.  Most  of  the  democrats  are  connected  to  other 
democrats  and  most  of  the  republicans  are  connected  to  other  republicans  (in  particular, 
the  number  of  edges  between  democrats  and  republicans  is  approximately  0.1  fraction  of 
the  total  number  of  edges).  The  senate  minority  leader,  McConnell,  is  well  connected  to 
other  republicans  where  as  the  senate  majority  leader,  Reid,  is  not  well  connected  to  other 
democrats.  Sanders  and  Lieberman,  both  of  who  caucus  with  democrats  have  more  edges 
to  democrats  than  to  republicans.  We  use  the  graph  drawing  algorithm  of  Kamada  and 
Kawai  to  render  the  graph  (Kamada  and  Kawai,  1989). 
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Abstract 

Markov  Random  Fields  (MRFs),  a.k.a.  Graphical  Models,  serve  as  popular  models  for 
networks  in  the  social  and  biological  sciences,  as  well  as  communications  and  signal  pro¬ 
cessing.  A  central  problem  is  one  of  structure  learning  or  model  selection:  given  samples 
from  the  MRF,  determine  the  graph  structure  of  the  underlying  distribution.  When  the 
MRF  is  not  Gaussian  (e.g.  the  Ising  model)  and  contains  cycles,  structure  learning  is 
known  to  be  NP  hard  even  with  infinite  samples.  Existing  approaches  typically  focus  ei¬ 
ther  on  specific  parametric  classes  of  models,  or  on  the  sub-class  of  graphs  with  bounded 
degree;  the  complexity  of  many  of  these  methods  grows  quickly  in  the  degree  bound.  We 
develop  a  simple  new  ‘greedy’  algorithm  for  learning  the  structure  of  graphical  models  of 
discrete  random  variables.  It  learns  the  Markov  neighborhood  of  a  node  by  sequentially 
adding  to  it  the  node  that  produces  the  highest  reduction  in  conditional  entropy.  We  pro¬ 
vide  a  general  sufficient  condition  for  exact  structure  recovery  (under  conditions  on  the 
degree/girth/correlation  decay),  and  study  its  sample  and  computational  complexity.  We 
then  consider  its  implications  for  the  Ising  model,  for  which  we  establish  a  self-contained 
condition  for  exact  structure  recovery. 

1.  Introduction 

Markov  Random  Fields  (MRF)  are  undirected  graphical  models  which  are  used  to  encode 
conditional  independence  relations  between  random  variables.  At  a  more  abstract  level, 
a  graphical  model  captures  the  dependencies  between  a  collection  of  entities.  Thus  the 
nodes  of  a  graphical  model  may  represent  people,  genes,  languages,  processes,  etc.,  while 
the  graphical  model  illustrates  certain  conditional  dependencies  among  them  (for  example, 
influence  in  a  social  network,  physiological  functionality  in  genetic  networks,  etc.).  Often 
the  knowledge  of  the  underlying  graph  is  not  available  beforehand,  but  must  be  inferred 
from  certain  observations  of  the  system.  In  mathematical  terms,  these  observations  corre¬ 
spond  to  samples  drawn  from  the  underlying  distribution.  Thus,  the  core  task  of  structure 
learning  is  that  of  inferring  conditional  dependencies  between  random  variables  from  i.i.d 
samples  drawn  from  their  joint  distribution.  The  importance  of  the  MRF  in  understanding 

*.  The  results  in  this  paper  were  presented  in  (Netrapalli  et  ah,  2010)  without  proofs  of  the  theorems.  This 
paper  includes  all  the  proofs  along  with  simulations. 
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the  underlying  system  makes  structure  learning  an  important  primitive  for  studying  such 
systems. 

More  specifically,  an  MRF  is  an  undirected  graph  G(V,E),  where  the  vertex  set  V  = 
{vi,V2,  ■  ■  ■  ,vp}  corresponds  to  a  p-dimensional  random  variable  X  =  {X\,  X2,  ■  ■  ■ ,  Xp} 
(whereby  each  vertex  i  is  associated  with  variable  Xi),  and  the  edges  encode  the  condi¬ 
tional  dependencies  between  the  random  variables  (this  is  explained  in  detail  in  Section  2). 
A  structure  learning  algorithm  takes  as  input,  samples  drawn  from  the  distribution  of  X , 
and  outputs  an  estimate  G  of  the  underlying  MRF.  There  are  three  primary  yardsticks  for 
a  structure  learning  algorithm:-  correctness,  sample  complexity  and  computational  com¬ 
plexity.  The  three  are  interdependent,  and  in  a  sense  an  ideal  structure  learning  algorithm 
is  one  which  can  learn  any  underlying  graph  on  the  nodes  with  high  probability  (or  with 
probability  of  error  less  than  some  given  5,  analogous  to  the  PAC  model  of  learning)  with 
associated  sample  complexity  and  computational  complexity  polynomial  in  p  and  | .  How¬ 
ever,  it  is  known  that  the  general  structure  learning  problem  is  a  difficult  problem,  both 
in  terms  of  sample  complexity  (Santhanam  and  Wainwright,  2009;  Bento  and  Montanari, 
2009)  and  computational  complexity  (Srebro,  2003;  Bogdanov  et  al.,  2008).  Inspite  of  this, 
the  practical  importance  of  the  problem  has  motivated  a  lot  of  work  in  this  topic,  and  there 
are  several  approaches  in  the  literature  that,  although  not  optimal,  perform  well  (both  in 
practice,  and  also  theoretically)  under  some  stronger  constraints  on  the  problem. 

There  are  two  fundamental  ways  to  perform  structure  learning,  corresponding  to  two 
different  interpretations  of  a  graphical  model.  Under  certain  conditions  (given  by  the 
Hannnersley-Clifford  theorem  (Wainwright  and  Jordan,  2008)),  the  conditional  indepen¬ 
dence  view  of  a  graphical  model  leads  to  a  factorization  of  the  joint  probability  mass  func¬ 
tion  (or  density)  according  to  the  cliques  of  the  graph.  Parameter  estimation  techniques 
(Ravikumar  et  ah,  2010;  Banerjee  et  ah,  2008)  utilize  such  a  factorization  of  the  distribution 
to  learn  the  underlying  graph.  These  techniques  assume  a  certain  form  of  the  potential  func¬ 
tion,  and  thereby  relate  the  structure  learning  problem  to  one  of  finding  a  sparse  maximum 
likelihood  estimator  of  a  distribution  from  its  samples.  On  the  other  hand,  algorithms  based 
on  learning  conditional  independence  relations  between  the  variables,  which  we  refer  to  as 
comparison  tests,  are  potential  agnostic,  i.e.,  they  do  not  need  knowledge  of  the  underlying 
parametrization  to  learn  the  graph.  These  methods  are  based  on  comparing  all  possible 
neighborhoods  of  a  node  to  find  one  which  has  the  ‘maximum  influence’  on  the  node.  In 
both  cases,  in  order  to  learn  the  underlying  graph  accurately  and  efficiently,  the  algorithms 
need  some  assumptions  on  the  underlying  distribution  and  graph  structure.  There  are  sev¬ 
eral  existing  comparison  test  based  methods  (Chow  and  Liu,  1968;  Abbeel  et  ah,  2006; 
Bresler  et  al.,  2008;  Anandkumar  and  Tan,  2011a, b),  each  with  associated  conditions  under 
which  they  can  learn  the  graph  correctly. 

In  addition  to  the  difference  in  underlying  assumptions,  there  is  another  fundamental 
difference  in  the  philosophy  of  the  two  approaches.  The  parameter  estimation  techniques 
tend  to  be  ‘bottom-up’  approaches,  whereby  the  algorithm  is  proposed  first,  based  on  some 
intuition  regarding  the  system,  and  then  subsequently  it  is  analyzed  and  conditions  are 
found  for  correctness  and  efficiency.  On  the  other  hand,  the  comparison-test  techniques  in 
literature  tend  to  be  designed  with  the  aim  of  achieving  some  correctness  requirements.  As  a 
result,  comparison-test  algorithms  usually  involve  a  computationally  expensive  search  over 
all  potential  neighborhoods  of  a  node,  and  this  increases  their  computational  complexity. 
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In  addition,  although  these  algorithms  make  no  assumptions  on  the  parametrization  of  the 
distribution,  they  need  to  assume  some  properties  of  the  graph  in  order  to  succeed  (for 
example,  the  algorithm  of  Bresler  et.  al.  (Bresler  et  al.,  2008)  needs  to  know  the  maximum 
degree  of  the  graph  in  order  to  learn  it).  Our  contribution  in  this  work  is  to  propose  a 
simple  ‘greedy’,  comparison-test  based  algorithm  for  learning  MRF  structure.  As  in  any 
sub-optimal  greedy  algorithm,  we  can  not  always  guarantee  correctness,  but  are  guaranteed 
low  computational  complexity.  However,  we  are  able  to  provide  general  sufficient  conditions 
for  the  success  of  the  algorithm  for  any  graphical  model,  and  show  that  these  conditions  are 
in  fact  satisfied  by  one  specific  graphical  model  of  significance  in  literature:  the  pairwise 
symmetric  binary  model,  or  the  Ising  model. 

Greedy  comparison-tests  for  exact  structure  learning  are  however  not  completely  new, 
and  in  fact  one  of  the  early  successes  in  the  field  was  in  the  form  of  a  greedy  algorithm.  In 
their  seminal  paper,  Chow  and  Liu  (Chow  and  Liu,  1968)  showed  that  if  the  MRF  was  a 
tree,  then  it  could  be  learnt  by  a  simple  maximum  spanning  tree  algorithm.  However  their 
method  is  crucially  dependent  on  the  underlying  graph  being  a  spanning  tree  (although 
recent  results  (Tan  et  al.,  2010)  have  shown  how  it  can  be  modified  to  learn  general  acyclic 
graphs),  and  fails  as  soon  as  the  graph  has  loops.  Our  algorithm,  in  some  sense,  generalizes 
the  Chow  and  Liu  algorithm  to  a  richer  class  of  graphs.  This  is  in  spirit  similar  to  the 
manner  in  which  loopy  belief  propagation  extends  the  dynamic  programming  paradigm 
from  trees  to  loopy  graphs.  One  notes  however  that  unlike  the  Chow  and  Liu  algorithm 
which  searches  for  a  globally  optimal  graph,  ours  is  a  locally  greedy  algorithm,  whereby  we 
learn  the  neighborhood  of  each  node  separately  in  a  greedy  manner. 

The  remaining  sections  are  organized  as  follows.  In  Section  2,  we  review  graphical  models 
and  some  results  from  information  theory,  and  set  up  the  structure  learning  problem.  Our 
new  structure  learning  algorithm,  GreedyAlgoritlun(e),  is  given  in  Section  3.  Next,  in 
Section  4,  we  develop  a  sufficient  condition  for  the  correctness  of  the  algorithm  for  general 
graphs.  To  demonstrate  the  applicability  of  this  condition,  we  translate  it  into  equivalent 
conditions  for  learning  an  Ising  model  in  Section  5.  We  present  simulation  results  evaluating 
our  algorithm  in  Section  6.  We  discuss  future  work  and  conclude  in  Section  7.  The  proofs 
of  theorems  are  in  the  Appendix. 

2.  Preliminaries 

In  this  section,  we  formally  define  a  graphical  model  and  set  up  the  structure  learning 
algorithm.  In  addition,  as  a  foreshadow  to  our  structure  learning  algorithm,  we  define 
conditional  entropy,  and  state  some  of  its  properties  which  we  use  later.  We  also  define 
a  notion  of  ‘empirical’  conditional  entropy  which  we  later  use  as  our  test  function,  and 
state  an  important  lemma  from  information  theory  that  helps  relate  empirical  entropy  and 
empirical  measures.  For  more  details  regarding  graphical  models,  refer  to  (Wainwright  and 
Jordan,  2008),  and  for  the  information  theoretic  definitions,  refer  to  (Cover  and  Thomas, 
2006). 

First  we  establish  some  notation  that  we  use  throughout.  We  assume  in  this  paper 
that  the  random  vector  X  whose  graph  we  are  trying  to  learn  is  discrete  valued.  More 
specifically,  we  assume  that  X  is  an  n-dimensional  random  vector  {X\ ,  X2 , . . . ,  Xn},  where 
each  component  Xt  of  X  takes  values  in  a  finite  set  X .  We  use  the  shorthand  notation 
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P(xi)  to  stand  for  P(Xj  =  xf),Xi  £  X ,  and  similarly  for  a  set  A  C  {1, 2, ... ,  n},  we  define 
P(xa)  —  P( Xa  =  XA),xa  £  dfl"4!,  where  W4  =  {Aj|z  €  A}. 

2.1  Graphical  Models  and  Structure  Learning 

As  mentioned  before,  an  undirected  graphical  model  corresponding  to  a  probability  distribu¬ 
tion  is  specified  by  an  undirected  graph  G  =  (V,  E),  with  each  vertex  Vi  £  V  corresponding 
to  a  random  variable  Xt  which  is  a  component  of  a  p-dimensional  random  vector  X  (for  ease 
of  notation,  henceforth  when  we  mention  a  node,  we  refer  to  the  physical  node  in  the  graph, 
and  the  associated  random  variable.  The  exact  meaning  should  be  clear  from  the  context). 
The  edges  E  C  V  x  V  of  a  graphical  model  can  be  viewed  as  encoding  the  probability  dis¬ 
tribution  of  X  in  several  ways,  all  of  which  are  equivalent  under  certain  conditions.  For  the 
purposes  of  structure  learning,  an  important  interpretation  is  the  local  Markov  property, 
stated  below. 

Definition  1  (Local  Markov)  Given  G(V,E),  let  N(i)  =  {j  G  V\(i,j)  G  E}  denote  the 
neighborhood  of  node  i.  Then  a  random  vector  X  is  said  to  obey  the  local  Markov  property 
with  respect  to  the  graph  G  if  for  every  Xi  G  V ,  conditioned  on  the  nodes  in  the  neighborhood 
of  i,  the  node  i  is  independent  of  the  remaining  nodes  in  the  graph.  Mathematically,  this 
means  that  for  any  set  B  G  V  \  {z}  U  N(i),  we  have  that  P{xi\xj^u\,XB)  =  P{%i\xN(i))  for 

X N(i ) 

all  (xi,  xb)  £  i+l-Mdl+l-5! .  We  henceforth  write  this  as  Xi  _LL  AVudcwfi) . 

Finally,  the  structure  learning  problem  is  stated  formally  as  follows:  given  n  i.i.d.  sam¬ 
ples  drawn  from  a  random  variable  X  with  MRP  G,  give  a  learning  algorithm  and  associated 
conditions  such  that  the  hypothesis  of  the  algorithm,  G,  is  equal  to  the  true  MRF  G  with 
probability  greater  than  1  —  <5. 

2.2  Factor  Graphs 

Every  graphical  model  has  a  factor  graph  representation  defined  as  follows. 

Definition  2  (Factor  Graph)  Given  a  graphical  model  G(V,E)  its  factor  graph  is  a  bi¬ 
partite  graph  Gf  with  vertex  set  V  U  C  where  each  vertex  c  G  C  corresponds  to  a  maximal 
clique  in  G.  For  any  v  G  V  and  c  G  C,  there  is  an  edge  {u,c}  in  Gf  if  and  only  if  v  G  c  in 

G. 

We  have  the  following  simple  lemma  relating  the  distance  between  two  nodes  i,j  £  V  in 
the  graphs  G  and  Gf. 

Lemma  1  Given  a  graph  G,  let  Gf  be  its  factor  graph.  Then  for  every  i,j  £  V  we  have 
df(i,j)  =  2 d(i,j)  where  d  and  df  are  the  distances  between  i  and  j  in  G  and  Gf  respectively. 

2.3  Conditional  Entropy  Tests 

As  we  described  before  in  the  introduction,  a  comparison-test  based  method  of  structure 
learning  is  based  on  using  a  test  function  to  compare  candidate  graphs.  Although  there  are 
several  different  implementations,  they  are  all  based  on  the  local  Markov  interpretation  of 
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the  graph.  More  specifically,  most  comparison-test  algorithms  try  to  learn  the  neighbor¬ 
hood  of  each  individual  node  by  comparing  potential  neighborhoods  using  a  test  function. 
Following  the  approach  of  Abbeel  et.  al.(Abbeel  et  al.,  2006),  we  use  conditional  entropies 
as  our  test  function  for  selecting  nodes.  In  this  section,  we  provide  the  necessary  definitions, 
and  also  state  some  results  from  information  theory  that  underlies  our  approach. 

First  we  need  to  define  a  few  quantities  which  we  use  throughout  this  paper.  Given  a 
discrete- valued  random  variable  Y  taking  values  in  a  finite  set  y  such  that  P(F  =  y)  = 
py  >  0V y  £  y ,  and  given  n  i.i.d  samples  {Y W}™=1,  the  empirical  probability  mass  function 
P{y),y  £  y  is  defined  as, 


P(y)  =  ]lllt{Y£)=y}^y£y. 

i= 1 


The  empirical  entropy  H{Y )  is  defined  as  the  entropy  of  the  empirical  distribution  P. 

Next,  given  two  variables  Y] ,  Y^,  both  taking  values  in  y,  we  can  extend  this  notation 
to  define  empirical  conditional  measures  of  the  form 


^(yi|y2)  = 


Et-  •  1  ^jvW- 


{Yy>=yuY^=y2} 

Et=i  1{y«=?/2} 


,  V(yi,y2)  €  y2 


Finally,  for  fixed  y2  £  Y  we  define  empirical  conditional  entropy 


H(Yi\Y2  =  y2)  =  ~^2  ^(?/i|y2)logP(yi|y2), 
yi&y 


and  using  this  we  define, 

H(Y1\Y2)=  P(y2)H(Y1\y2) 

V2&y 

Given  samples,  we  use  the  empirical  conditional  entropies  as  given  above  as  the  proxy  for 
the  actual  conditional  entropy.  Note  also  that  we  can  define  set  based  versions  of  all  the 
above  statements  in  a  similar  manner. 

The  use  of  conditional  entropies  as  a  test  function  is  motivated  by  two  reasons: 

1.  By  the  local  Markov  property,  the  conditional  entropy  for  a  node  is  minimized  by  sets 
which  contain  the  true  neighborhood,  and  hence  (under  some  weak  non-degeneracy 
conditions),  the  smallest  cardinality  set  which  minimizes  the  conditional  entropy  is 
the  true  neighborhood. 

2.  Entropy  and  measure  are  related  in  the  sense  that  two  probability  measures  on  a  set 
are  close  if  their  entropies  are  close  and  vice  versa. 

The  first  point  is  the  main  reason  behind  using  conditional  entropies  as  a  test  function,  as  it 
reduces  the  problem  of  finding  a  neighborhood  to  that  of  finding  a  set  which  minimizes  an 
appropriate  function,  and  also  indicates  a  natural  greedy  sequential  approach  to  selecting 
the  neighbors.  We  encode  this  notion  in  the  following  proposition,  which  can  be  easily 
derived  from  the  Data  Processing  Inequality,  see  (Cover  and  Thomas,  2006). 
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Proposition  1  For  any  node  i  €  V ,  we  have  that, 

H{Xi\Xm)  <  H(Xi \XA), 

for  any  set  4CV'  \  {*}. 

The  second  point  can  be  thought  of  as  indicating  that  no  information  is  lost  if  we  use 
entropies  instead  of  measures  to  learn  the  structure.  This  notion  can  be  quantified  in  terms 
of  the  following  proposition,  which  we  get  by  combining  Theorem  16.3.2  and  Lemma  16.3.1 
from  (Cover  and  Thomas,  2006). 

Proposition  2  Let  P  and  Q  be  two  probability  mass  functions  in  a  finite  set  X ,  with 
entropies  H(P )  and  H(Q)  respectively,  and  with  total  variational  distance  ||P  —  Q||i  given 
by: 

\\P-Q\\1  =  Y^\P(x)-Q(x)\. 

x&X 

Then 

I  H(P)  -  H(Q)  |  <  -||P  -  Qlli  log  l|P|~^111.  (1) 

Further,  if  the  relative  entropy  between  them  is  given  by  D(P||Q),  then 

D(P\\Q)>^^\\P-Q\\l  (2) 

We  use  this  proposition  in  several  places  in  subsequent  proofs.  At  a  high  level,  (1) 
allows  us  to  leverage  results  of  convergence  of  empirical  measures  to  the  true  measure  to 
obtain  similar  guarantees  on  the  empirical  entropy,  while  (2)  is  used  to  convert  entropy 
conditions  to  equivalent  conditions  on  the  measure  (in  particular,  this  allows  us  to  state 
our  non-degeneracy  conditions  directly  in  terms  of  the  conditional  entropy,  instead  of  more 
complicated  statements  in  terms  of  probability  distributions  usually  found  in  literature 
(Bresler  et  ah,  2008)). 

3.  The  Greedy Algorithm(e)  Structure  Learning  Algorithm 

In  this  section,  we  present  our  greedy  structure  learning  algorithm,  which  we  henceforth 
refer  to  as 

Greedy  Algorithm  (e).  We  also  argue  that  it  always  has  a  low  worst-case  computation  com¬ 
plexity,  owing  to  its  greedy  nature.  The  challenge  however  is  to  find  conditions  that  guar¬ 
antee  correctness,  and  this  question  is  addressed  in  subsequent  sections. 

At  a  high  level,  our  algorithm  considers  each  node  separately,  and  adds  nodes  to  its 
neighborhood  sequentially  in  a  greedy  manner.  In  particular,  at  each  step  we  find  the  node 
that  provides  the  highest  reduction  in  conditional  entropy  when  added  to  the  existing  set. 
We  stop  when  this  reduction  is  smaller  than  e. 

More  specifically,  Greedy Algorithm(e)  takes  as  input  the  n  samples  and  a  single  ‘thresh¬ 
old’  value  e.  Given  any  node  i,  the  candidate  neighborhood  N(i)  of  the  node  is  initially  set 
to  4>  and  is  learnt  in  a  sequential  manner.  In  the  first  stage,  the  node  j  ^  j  which  minimizes 
the  conditional  entropy  H(Xj\Xj)  is  chosen  as  a  candidate  neighbor,  and  is  added  to  N(i) 


6 


if  conditioning  on  the  node  j  reduces  the  entropy  by  at-least  e/2.  In  any  subsequent  stage, 
a  candidate  node  k  £  V\N(i )  is  chosen  as  one  which  minimizes  i7(A/|A*.,  and  is 

added  if  it  reduces  the  conditional  entropy  by  at-least  e/2.  At  any  stage  when  this  condition 
is  not  satisfied,  the  algorithm  outputs  N (i)  and  moves  on  to  the  next  node. 

GreedyAlgorithm(e)  for  structure  learning  is  formally  presented  in  Algorithm  1. 

Algorithm  1  Greedy  Algorithm  (e) 

1:  for  %  G  V  do 
2:  complete  •(—  FALSE 

3:  N(i)  <-  4* 

4:  while  !  complete  do 

5:  j  =  argmin  H(Xi  \  Xft,~,Xk) 

keV\N(i) 

6:  if  H{X,  I  Xm,Xj)  <  H(X,  I  Xm)  -  |  then 

7:  N(i)  <-  N(i )  U  {j} 

8:  else 

9:  complete  TRUE 

10:  end  if 

11:  end  while 

12:  end  for 


Since  the  algorithm  is  greedy,  we  can  characterize  its  worst  case  computational  com¬ 
plexity  independent  of  its  correctness  guarantees. 

Proposition  3  The  running  time  of  Algorithm  1  is  0(np4)  where  n  is  the  number  of  sam¬ 
ples  and  p  is  the  number  of  random  variables. 

Proof  The  outer  for  loop  is  executed  0(p)  times.  For  every  iteration  of  the  outer  for 
loop,  the  while  loop  (lines  4-11)  is  run  0(p)  times.  In  every  iteration  of  the  while  loop,  line 
5  calculates  the  empirical  entropy  conditioned  on  each  of  the  nodes  in  N(i).  Thus,  in  the 
worst  case,  the  algorithm  performs  0(p 3)  comparison  tests  (empirical  conditional  entropy 
calculation  from  samples).  Even  assuming  a  naive  implementation  of  a  single  comparison 
test  that  takes  0(np),  the  overall  time  taken  by  the  algorithm  is  0(np4).  ■ 


This  shows  that  GreedyAlgorithm(e)  always  has  low  computational  complexity  for  any 
graph  (and  in  particular,  in  Section  4,  we  show  that  for  a  large  class  of  graphs,  the  algorithm 
has  running  time  of  0(np2)).  The  tradeoff  is  however  in  correctness  guarantees.  The 
problem  arises  in  the  fact  that  unlike  other  comparison-test  algorithms  which  are  designed 
to  ensure  certain  correctness  guarantees,  our  algorithm  is  designed  more  from  the  point  of 
view  of  simplicity  and  low  computational  costs.  Therefore  to  derive  theoretical  guarantees 
for  the  algorithm,  it  is  first  important  to  understand  the  failure  mechanism  of  the  algorithm. 

4.  Sufficient  Conditions  for  General  Discrete  Graphical  Models 

In  this  section,  we  provide  guarantees  for  general  discrete  graphical  models,  under  which 
Greedy  Algorithm  (e)  recovers  the  graphical  model  structure  exactly.  First,  using  an  exam- 
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pie,  we  build  up  intuition  for  the  sufficient  conditions,  and  define  two  key  notions:  non¬ 
degeneracy  conditions  and  correlation  decay.  Our  main  result  is  presented  in  Section  4.2, 
wherein  we  give  a  sufficient  condition  for  the  correctness  of  the  algorithm  in  general  discrete 
graphical  models. 

4.1  Non-Degeneracy  and  Correlation  Decay 

Before  analyzing  the  correctness  of  structure  learning  from  samples,  a  simpler  problem  worth 
considering  is  one  of  algorithm  consistency,  i.e.,  does  the  algorithm  succeed  to  identify  the 
true  graph  given  the  true  conditional  distributions  (or  in  other  words,  given  an  infinite 
number  of  samples).  It  turns  out  that  the  algorithm  as  presented  in  Algorithm  1  does  not 
even  possess  this  property,  as  is  illustrated  by  the  following  counter-example 

Let  V  =  {0, 1,  •  •  •  ,  D,  D+ 1},  Xi  €  {  —  1, 1}V*  G  V  and  E  =  {{0,  i},  {i,  D  +  1}  |  1  <  i  <  D}. 
Let  P{x\r)  =  \  PJ  edxixi ,  where  Z  is  a  normalizing  constant  (this  is  the  classical  zero- 
(U}e£ 

held  Ising  model  potential).  The  graph  is  shown  in  Fig.  1. 


Figure  1:  An  example  of  adding  spurious  nodes:  Execution  of  GreedyAlgorithm(e)  for  node 
0  adds  node  D  +  1  in  the  first  iteration,  even  though  it  is  not  a  neighbor. 

Suppose  the  actual  entropies  are  given  as  input  to  Algorithm  1.  It  can  be  shown  in 
this  case  that  for  a  given  9,  there  exists  a  -^thresh  such  that  if  D  >  -^thresh’  then  the 
output  of  Algorithm  1  will  select  the  edge  {0,  D  +  1}  in  the  first  iteration.  This  is  easily 
understood  because  if  D  is  large,  the  distribution  of  node  0  is  best  accounted  for  by  node 
D  +  1,  although  it  is  not  a  neighbor.  Thus,  even  with  exact  entropies,  the  algorithm  will 
always  include  edge  (0,  D  +  1),  although  it  does  not  exist  in  the  graph. 

The  algorithm  can  however  easily  be  shown  to  satisfy  the  following  weaker  consistency 
guarantee:  given  infinite  samples,  for  any  node  in  the  graph,  the  algorithm  will  return  a 
super-neighborhood ,  i.e.,  a  superset  of  the  neighborhood  of  i.  This  suggests  a  simple  fix  to 
obtain  a  consistent  algorithm,  as  we  can  follow  the  greedy  phase  by  a  ‘node-pruning’  phase, 
wherein  we  test  each  node  in  the  neighborhood  of  a  node  i  returned  by  the  algorithm  (to  do 
this,  we  can  compare  the  entropy  of  i  conditioned  on  the  neighborhood  with  and  without 
a  node,  and  remove  it  if  they  are  the  same).  However  the  problem  is  complicated  by  the 
presence  of  samples,  as  pruning  a  large  super-neighborhood  requires  calculating  estimates 
of  entropy  conditioned  on  a  large  number  of  nodes,  and  hence  this  drives  up  the  sample 
complexity.  In  the  rest  of  the  paper,  we  avoid  this  problem  by  ignoring  the  pruning  step, 
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and  instead  prove  a  stronger  correctness  guarantee:  given  any  node  i,  the  algorithm  always 
picks  a  correct  neighbor  of  i  as  long  as  any  one  remains  undiscovered.  Towards  this  end, 
we  first  define  two  conditions  which  we  require  for  the  correctness  of  GreedyAlgorithm(e). 

Assumption  1  (Non-degeneracy)  Choose  a  node  i.  Let  N(i)  be  the  set  of  its  neighbors. 
Then  Be  >  0  such  that  V  A  C  N(i),  V  j  £  N(i)  \  A  and  V  l  £  N (j)  \  {i},  we  have  that 

H(Xi  |  XA)  -  H(Xi  |  XA,  Xj)  >  e  and  (3) 

H(Xt  |  XA,  Xi)  -  H(Xi  |  XA,  Xj,Xi)  >  e  (4) 

Assumption  1  is  illustrated  in  Fig.  2. 


Figure  2:  Non-degeneracy  condition  for  node  i :  (i)  Entropy  of  i  conditioned  on  any  sub¬ 
neighborhood  A  reduces  by  at-least  e  if  any  other  neighbor  j  is  added  to  the 
conditioning  set,  (ii)  Entropy  of  i  conditioned  on  A  and  a  two  hop  neighbor  l 
reduces  by  at-least  e  if  the  corresponding  one  hop  neighbor  j  is  added  to  the 
conditioning  set 


Assumption  2  (Correlation  Decay)  Choose  a  node  i.  Let  Arl(i)  and  N2(i)  be  the  sets 

of  its  1-hop  and  2-hop  neighbors  respectively.  Choose  another  set  of  nodes  B.  Let  d(i,B)  = 

min  d(i,j),  where  d(i,j)  denotes  the  distance  between  nodes  i  and  j.  Then,  we  have  that 
jeB 

TXi ,  Xj\[  1  ^ ,  X 2  (j) ,  Xjs 

\P(xi,  xNi (j) ,  xN2 (j)  I  xB)  -P(xi,x Ni(i),xN2^)\  <  f{d(i,B )) 

where  f  is  a  monotonic  decreasing  function. 

Assumption  1  (or  a  variant  thereof)  is  a  standard  assumption  for  showing  correctness 
of  any  structure  learning  algorithm,  as  it  ensures  that  there  is  a  unique  minimal  graphical 
model  for  the  distribution  from  which  the  samples  are  generated.  Although  the  way  we 
state  the  assumption  is  tailored  to  our  algorithm,  it  can  be  shown  to  be  equivalent  to 
similar  assumptions  in  literature(Bresler  et  ah,  2008).  Informally  speaking,  Assumption  1 
states  that  for  node  i,  any  2-hop  neighbor  captures  less  information  about  node  i  than  the 
corresponding  1-hop  neighbor.  In  the  case  of  a  Markov  Chain,  Assumption  1  reduces  to  a 
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weaker  version  of  an  e— Data  Processing  Inequality  (i.e. ,  DPI  with  an  epsilon  gap),  and  in 
a  sense,  Assumption  1  can  be  viewed  as  a  generalized  e— DPI  for  networks  with  cycles. 

On  the  other  hand,  Assumption  2  along  with  large  girth  implies  that  the  information  a 
node  j  has  about  node  i  is  ‘almost  Markov’  along  the  shortest  path  between  i  and  j.  This 
in  conjunction  with  Assumption  1  implies  that  for  any  two  nodes  i  and  k,  the  information 
about  i  captured  by  k  is  less  than  that  captured  by  j  where  j  is  the  neighbor  of  i  on  the 
shortest  path  between  i  and  k. 


4.2  Guarantees  for  the  Recovery  of  a  General  Graphical  Model 

We  now  state  our  main  theorem,  wherein  we  give  a  sufficient  condition  for  correctness  of 
Greedy  Algorithm  (e)  in  a  general  graphical  model. 

The  counter-example  given  in  Section  4.1  suggests  that  the  addition  of  spurious  nodes 
to  the  neighborhood  of  i  is  related  to  the  existence  of  non-neighboring  nodes  of  i  which 
somehow  accumulate  sufficient  influence  over  it.  The  accumulation  of  influence  is  due  to 
slow  decay  of  influence  on  short  paths  (corresponding  to  a  high  6  in  the  example),  and 
the  effect  of  a  large  number  of  such  paths  (corresponding  to  high  D).  Correlation  decay 
(Assumption  2)  allows  us  to  control  the  first.  Intuitively,  the  second  can  be  controlled  if 
the  neighborhood  of  i  is  ‘locally  tree-like’.  To  quantify  this  notion,  we  define  the  girth  of 
a  graph  Girth(G)  to  be  the  length  of  the  smallest  cycle  in  the  graph  G.  Now  we  have  the 
following  theorem. 


Theorem  2  Consider  a  graphical  model  G  where  the  random  variable  corresponding  to 
each  node  takes  values  in  a  set  X  and  satisfies  the  following: 

•  Non- degeneracy  (Assumption  1)  with  parameter  e, 

•  Correlation  decay  (Assumption  2)  with  decay  function  /(•), 


•  Maximum  degree  D. 

Define  h  =  h(e,  D)  =  c  ^  -  and  suppose  f 

factor  graph  of  G)  obeys  the  following  condition: 


-l 


(. h )  exists. 


Further  suppose  Gf  (the 


Girth(Gf)  4  9f  >  4  (f-1  ( h )  +  l)  . 


(5) 


Then,  given  6  >  0,  Greedy Algorithm(e)  recovers  G  exactly  with  probability  greater  than  1  —  5 
with  sample  complexity  n  =  £  (e_4log|),  where  f  is  a  constant  independent  of  p,e  and  6. 


The  proof  follows  from  the  following  two  lemmas.  Lemma  3  implies  that  if  we  had  access  to 
actual  entropies,  Algorithm  1  always  recovers  the  neighborhood  of  a  node  exactly.  Lemma 
4  shows  that  with  the  number  of  samples  n  as  stated  in  Theorem  2,  the  empirical  entropies 
are  very  close  to  the  actual  entropies  with  high  probability  and  hence  Algorithm  1  recovers 
the  graphical  model  structure  exactly  with  high  probability  even  with  empirical  entropies. 


Lemma  3  Consider  a  graphical  model  G  in  which  node  i  satisfies  Assumptions  1  and  2. 
Let  the  girth  of  Gf  be  gf  >  4  (/_1  (h)  +  l),  where  h  is  as  defined  in  Theorem  2.  Then, 
V  A  C  N(i),  u  N(i),  3  j  G  N(i)  \  A  such  that 

H(Xi  |  XA,  Xj)  <  H(Xi  |  XA,  Xu)  -  j 
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(6) 


Proof  If  A  separates  i  and  u  in  Gf  it  also  does  so  in  G.  Then  we  have  that  P(xi\xA,  xu )  = 
P(xi\xa )  and  hence  H(Xi  \  XA,XU)  =  H(Xi  \  XA).  Then,  the  statement  of  the  lemma 
follows  from  (3). 

Now  suppose  A  does  not  separate  i  and  u  in  Gf.  Consider  the  shortest  path  between  i 
and  u  in  Gf  \  A.  Let  j  G  N(i)  \  A  and  l  G  N(j)  \  {z}  be  on  that  shortest  path.  Assumption 
1  implies  that  H(Xi  \  Xa,Xi)  —  H(Xi  |  XA,  Xj,X\)  >  e.  Now,  choose  B  G  V  such  that 
AUBU  {j}  separates  i  and  l  in  Gf  and  df(i,B)  >  where  gf  is  the  girth  of  Gf. 

Note  that  such  a  B  (possibly  empty)  exists  since  the  girth  of  Gf  is  gf  and  if  a  node  in  the 
separator  is  a  factor  node  (i.e. ,  not  in  V)  then  we  can  replace  it  by  all  its  neighbors  (in  V). 
We  then  see  using  Lemma  1  that  d(i,  B)  >  9f4  From  Assumption  2,  we  know  that 

|P(xj,  XjY(j)uAr2(i))  ^(*^0  *^'Ar(i)UAf2(i)  I  ^  /  (“47 

^  ^2  \P(Xi,XA,Xj)  ~  P(Xi,XA,Xj  I  xB)\  <  |ff|(-D+1)2/  -  i)  M  XB 

Xi,XA,Xj 

=>  H{Xi,XA,Xj)  -  H(Xi,  XA,  Xj  I  XB)  <  -|A’|(D+1)2/  (f  -  1)  (log  /  (f  -  l))  ^  e 
^  (H(Xi  |  XA,  Xj)  +  H(XA,  Xj))  -  ( H(Xi  \  XA,  Xj,XB)  +  H{XA,  Xj  \  XB))  <  e 
=>  H(Xi  |  Aa,  Xj)  -  H(Xt  |  XA, Xj,XB)  <  e, 

where  the  first  implication  follows  from  marginalizing  irrelevant  variables  and  the  second 
implication  follows  from  (1).  Using  this  we  have  that, 

H(Xi  |  XA,  Xj,Xi)  >  H(Xi  \  XA,  Xj,XhXB) 

XA,Xj,XB 

=  B(Xj  |  XA,  Xj,XB)  since  Xi  _LL  Xi 
>  H{Xt\XA,Xj)-e 

Using  a  similar  argument,  we  also  have, 

H(Xi  |  XA,XhXu)  >  H(Xi  |  XA,Xt)  -e 

Combining  the  two  inequalities,  and  using  the  fact  that  under  the  given  conditions  e  < 
we  get 

H(Xi  |  XA,  Xj)  <  H{Xi  \  XA,  Xu)-^. 


Lemma  4  Consider  a  graphical  model  G  in  which  each  node  takes  values  in  X .  Let  the 
number  of  samples  be 

n  >  215e”4| X\^D+2^  ((£>  +  2)  log  2| X\  +  2  log 

Let  P  and  H  denote  the  empirical  probability  and  empirical  entropy  as  defined  in  Section 
2.3. 

Then  V  i  G  G,  with  probability  greater  than  1  —  we  have  that  VAC  N(i),  u  ^  N{i) 

H(Xi  |  XA,  Xu)  -  H(Xi  \  XA,  Xu)  <  €- 

O 
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Proof  We  use  the  fact  that  given  sufficient  samples,  the  empirical  measure  is  close  to  the 
true  measure  uniformly  in  probability.  Specifically,  given  any  subset  A  C  V  of  nodes  and 
any  fixed  xA  G  we  have  by  Azuma’s  inequality  after  n  samples, 


P(xA)  -  P(xA ) 


>  7 


<  2exp(— 2y2n)  < 


26 


p2(2|A|)(D+2) 


where  7  =  2  8e2|X|  2(D+2) .  Let  V  be  the  set  of  all  vertices.  Now,  by  union  bound  over 
every  A  C  N(i ),  it  G  U  and  Xi,XA,xu,  we  have 


P(Xi,  xa,  xu)  -  P(Xi,  xa,  xu) 


>  7 


6 


(1)  then  implies 


H(Xi  |  XA,Xu)-H{Xi  |  XA,XU) 


< 


P' 


giving  us  the  required  result.  ■ 

Using  Lemmas  3  and  4,  we  have  the  following  :  V  i  G  G,  such  that  Assumptions  1  and  2  are 
satisfied,  with  probability  greater  than  1  —  we  have  that  VAC  N(i),  u  ^  X(i),  3  j  E 
N(i)\A  such  that 

H(Xi  |  XA,  Xj)  <  H(Xi  \  XA,  Xu)  -  |  (7) 

and  VUG,  such  that  Assumptions  1  and  2  are  satisfied,  VA  C  N(i),  j  G  AT(«)  \  A,  we 
have  that 

H(Xi  |  Aa,  X,)  <  H(Xi  |  XA)  -  |  (8) 


Proof  [Theorem  2]  The  proof  is  based  on  mathematical  induction.  The  induction  claim 
is  as  follows:  just  before  entering  an  iteration  of  the  WHILE  loop,  N(i)  C  N(i).  Clearly 
this  is  true  at  the  start  of  the  WHILE  loop  since  N(i)  =  <f>.  Suppose  it  is  true  just 
after  entering  the  /c^1  iteration.  If  N (i)  =  N(i)  then  clearly  Vj  G  V  \  N(i),  H(Xi  \ 
Xftf-yXj)  =  H(Xi  |  Xj^^).  Since  with  probability  greater  than  1  —  |  we  have  that 


HiXilX^X^-HiXi 


N(i ) 

also  have  that 


XN(i)  ’  X3 


<  |  and 


H(Xt  |  X^)  -  H(Xt  |  X 


N(i)> 


<  t,  we 


H(Xt 


XN(iyXi 


-  H{Xi  |  X 


N(i)J 


<  j.  So  control  exits  the  loop  with¬ 


out  changing  N(i).  On  the  other  hand,  if  3j  G  N(i)  \  N(i)  then  from  (8)  we  have  that 
H(Xi  |  XfiX)  —  H(Xi  |  Xfr^yXj)  >  |.  So,  a  node  is  chosen  to  be  added  to  N(i)  and 
control  does  not  exit  the  loop.  Now  suppose  for  contradiction  that  a  node  u  ^  N(i)  is  added 
to  N(i).  Then  we  have  that  H(X%  \  X^^,XU)  <  H{Xi  \  X^^, Xj).  But  this  contradicts 

(7).  Thus,  a  neighbor  j  G  N(i)\N(i)  is  picked  in  the  iteration  to  be  added  to  N(i),  proving 
that  the  neighborhood  of  i  is  recovered  exactly  with  probability  greater  than  1  —  Using 
union  bound,  it  is  easy  to  see  that  the  neighborhood  of  each  node  (i.e.,  the  graph  structure) 
is  recovered  exactly  with  probability  greater  than  1  —  6.  ■ 
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Remark  5  The  proof  for  Theorem  2  can  also  be  used  to  provide  node-wise  guarantees,  i.e., 
for  every  node  satisfying  Assumptions  1  and  2,  if  the  number  of  samples  is  sufficiently  large 
(in  terms  of  its  degree,  and  the  length  of  the  smallest  cycle  it  is  part  of),  its  neighborhood 
will  be  recovered  exactly  with  high  probability. 

Remark  6  Any  decreasing  correlation- decay  function  f  suffices  for  Theorem  2.  However, 
the  faster  the  correlation  decay,  the  smaller  the  girth  in  the  sufficient  condition  for  Theorem 
2  needs  to  be. 

And  finally  we  have  a  corollary  for  the  computational  complexity  of  Greedy  Algorithm  (e). 

Corollary  1  The  expected  run  time  of  Algorithm  1  is  O  (<5np4  +  (1  —  5)D2np2) .  Further, 
if  5  is  chosen  to  be  0(p ~2)  ,  the  sample  complexity  n  is  O(logp)  and  the  expected  run  time 
of  Algorithm  1  is  0(D2p2logp). 

Proof  For  the  second  part,  note  that  with  probability  greater  than  1  —  6,  the  algorithm 
recovers  the  correct  graph  structure  exactly.  In  this  case,  the  number  of  iterations  of  the 
while  loop  is  bounded  by  D  for  each  node  .  The  time  taken  to  compute  any  conditional 
entropy  is  bounded  by  0(nD).  Hence  the  total  run  time  is  0(D2np2).  Using  the  previous 
worst  case  bound  on  the  running  time,  we  obtain  the  result.  ■ 


5.  Guarantees  for  the  Recovery  of  an  Ising  Graphical  Model 

In  this  section,  we  show  how  Theorem  2  can  be  used  to  efficiently  learn  Ising  graphical 
models  satisfying  certain  conditions.  The  zero  field  Ising  model  is  a  pairwise,  symmetric, 
binary  graphical  model  which  is  widely  used  in  statistical  physics  to  model  the  alignment 
of  magnetic  spins  in  a  magnetic  field  (Brush,  1967).  It  is  defined  as  follows: 

Definition  3  A  set  of  random  variables  {Xv  \  v  G  V}  are  said  to  be  distributed  according 
to  a  zero  field  Ising  model  if 

1 .  Xv  G  {-1,1}  Vw  €  F  and 

2.  P(xy)  =  \  JJ  exp (OijXiXj) 

i,j£V 

where  Z  is  a  normalizing  constant.  The  graphical  model  of  such  a  set  of  random  variables 
is  given  by  G(V,  E)  where  E  =  {{*,  j}  |  dij  /  0}. 

It  is  easy  to  verify  that  this  satisfies  the  local  Markov  property.  Another  very  useful  property 
of  zero- field  Ising  models  is  that  they  are  symmetric  with  respect  to  —1  and  1.  Formally,  if 
P  is  the  probability  distribution  function  over  a  set  of  zero-field  Ising  distributed  random 
variables,  then,  P(xy)  =  P{—xv). 

The  main  contribution  of  this  section  is  in  the  form  of  the  following  theorem,  which 
translates  the  sufficient  conditions  from  Section  4  to  equivalent  conditions  for  an  Ising 
model. 
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Theorem  7  Consider  a  zero- field  Ising  model  on  a  graph  G  with  maximum  degree  D. 
Let  the  edge  parameters  9ij  be  bounded  in  the  absolute  value  by  0  <  fi  <  \9ij\  <  Let 

e  =  2-10  sinh2(2/3).  If  the  girth  of  the  graph  satisfies  g  >  {D2  log2  —  log  (sinh2/3)}  then 

with  samples  n  =  £e-4  log  |  (where  £  is  a  constant  independent  of  e,  5,p),  Greedy  Algorithmic) 
outputs  the  exact  structure  of  G  with  probability  greater  than  1  —  5. 

The  proof  of  this  theorem  consists  of  showing  that  an  Ising  graphical  model  satisfies 
Assumptions  1  and  2  if  the  graph  has  large  girth  and  the  parameters  on  the  edges  satisfy 
certain  conditions.  It  also  uses  the  fact  that  the  girth  gj  of  Gf  is  at  least  2 g.  In  Section  5.1, 
we  show  that  under  certain  conditions,  an  Ising  model  has  an  almost  exponential  correlation 
decay.  Then  in  Section  5.2,  we  use  the  correlation  decay  of  Ising  models  to  show  that  under 
some  further  conditions,  they  also  satisfy  Assumption  1  for  non- degeneracy.  Combining  the 
two,  we  get  the  above  sufficient  conditions  for  GreedyAlgorithm(e)  to  learn  the  structure  of 
an  Ising  graphical  model  with  high  probability. 

5.1  Correlation  Decay  in  Ising  Models 

We  will  start  by  proving  the  validity  of  Assumption  2  in  the  form  of  the  following  proposition. 

Proposition  4  Consider  a  zero-field  Ising  model  on  a  graph  G  with  maximum  degree  D 
and  girth  g.  Let  the  edge  parameters  9ij  be  bounded  in  the  absolute  value  by  \9ij\  < 

Then,  for  any  node  i,  its  neighbors  Nl(i),  its  2-hop  neighbors  N2(i)  and  a  set  of  nodes  A, 
we  have 

P(xi,xN i(j),aW2(i)  I  xa)  ~  P(xi,xN i(j),aW2(i))|  <  cexp  ^-^|^min  ( d(i,A ),  |  -  l) 

V  Xi,xN i(j),  a;jv2(i)  and  xa  (where  c  is  a  constant  independent  of  i  and  A). 

The  outline  of  the  proof  of  Proposition  4  is  as  follows.  First,  we  show  that  if  a  subset  of 
nodes  is  conditioned  on  a  Markov  blanket  (i.e. ,  on  another  subset  of  nodes  which  separates 
them  from  the  remaining  graph),  then  their  potentials  remain  the  same.  For  this  we  have 
the  following  lemma. 

Lemma  8  Consider  a  graphical  model  G(V,  E)  and  the  corresponding  factorizable  probabil¬ 
ity  distribution  function  P.  Let  A,  B  and  C  be  a  partition  ofV  and  B  separate  A  and  C  in 
G.  Let  G(A  U  B,  E)  be  the  induced  subgraph  of  G  on  Au  B,  with  the  same  edge  potentials 
as  G  on  all  its  edges  and  P  be  the  corresponding  probability  distribution  function.  Then,  we 
have  that  P(xd  \  xb)  =  P{xd  \  xb )  V  xd,xb  where  D  C  A. 

Now,  for  any  node  i,  the  induced  subgraph  on  all  nodes  which  are  at  distance  less  than 
|  —  1  is  a  tree.  Thus  we  can  concentrate  on  proving  correlation  decay  for  a  tree  Ising  model. 
We  do  this  through  the  following  steps: 

1.  Without  loss  of  generality,  the  tree  Ising  model  can  be  assumed  to  have  all  positive 
edge  parameters 

2.  The  worst  case  configuration  for  the  conditional  probability  of  the  root  node  is  when 
all  the  leaf  nodes  are  set  to  the  same  value  and  all  the  edge  parameters  are  set  to  the 
maximum  possible  value 
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3.  For  this  scenario,  correlation  decays  exponentially 
The  following  three  lemmas  encode  these  three  steps.  For  proofs,  refer  the  Appendix. 

Lemma  9  Consider  a  tree  Ising  graphical  model  T.  Let  the  corresponding  probability  dis¬ 
tribution  be  P.  Replace  all  the  edge  parameters  on  this  graphical  model  by  their  absolute 
values.  Let  the  corresponding  probability  distribution  after  this  change  be  P.  Then,  there 
exists  a  set  of  bijections 

{Mv  :  {—1, 1}  ->  {  —  1, 1}  |  v  €  V  \  {r}}  where  V  is  the  set  of  vertices  and  r  is  the  root  node 
such  that,  Vxr,xv\r  we  have  that  P(xr,xv\r)  =  P(xr,  Mv(xv),  v  €  V  \  r). 

Lemma  10  For  a  tree  Ising  graphical  model  T  with  root  r  and  set  of  leaves  L,  we  have 
(xr  =  1,xl  =  1)  €E  arg  max  \  P{xr  \  xjf)  —  P(xr)\ 

Xv,Xl 


And  finally  we  have  the  following  lemma. 

Lemma  11  In  a  tree  Ising  model,  suppose  \9ij\  <  7  <  where  D  is  the  maximum 
degree  of  the  graph.  Then  we  have  exponential  correlation  decay  between  the  root  node  r, 
its  neighbors  Arl(r),  its  2-hop  neighbors  N2(r )  and  the  set  of  leaves  L  i.e., 

\P(xr,xNi(r),xN2{r)  |  xL)  -  P(xr,xNi(r),xN 2(r))|  <  cexp(-i^d(r,L)) 

where  c  is  a  constant  independent  of  the  nodes  considered. 

5.2  Non-degeneracy  in  Ising  Models  with  Correlation  Decay 

Now  using  the  results  from  the  previous  section,  we  turn  our  attention  to  the  question  of 
correlation  decay.  In  particular,  we  have  the  following  lemma  which  says  that  if  an  Ising 
graphical  model  has  almost  exponential  correlation  decay  and  its  edge  parameters  satisfy 
certain  conditions,  then  it  also  satisfies  Assumption  1.  For  the  proof,  refer  the  Appendix. 

Lemma  12  Consider  an  Ising  graphical  model  with  edge  parameters  9ij  bounded  in  the 
absolute  value  by  0  <  /3  <  \9ij\  <  7,  max  degree  D,  and  having  correlation  decay  as  follows 

\P(xi,x Ni(i),xN2^)  -  P(xi,xN i^,xN2^\xB)\  <  cexp  ^-amin  ^ d(i,B ), 

V  i,  B,Xi,xNi^,xN2(ty  If  the  girth  g  >  2  +  ||(2 D  +  11)  log  2  +  logc  +  log  (l  +  2De2'r)  + 

27 (D  +  3)  —  log|sinh2/3|  j,  then  this  graphical  model  satisfies  Assumption  1  with  e  = 
2-7e-67B  sinh2  (20). 

Finally,  the  proof  of  Theorem  7  follows  directly  by  combining  Theorem  2,  Proposition 
4  and  Lemma  12.  For  complete  details,  refer  the  Appendix. 
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6.  Simulations 


In  this  section,  we  present  the  results  of  numerical  experiments  evaluating  the  performance 
of  our  algorithm.  There  are  two  important  points  to  be  noted  here.  The  first  is  that  to 
satisfy  the  conditions  so  that  our  theoretical  guarantees  are  applicable,  the  graph  should 
have  a  large  girth.  However,  to  demonstrate  the  fact  that  our  algorithm  is  practical,  we 
evaluate  our  algorithm  on  graphs  with  much  smaller  girth  than  what  is  required  for  our 
theoretical  guarantees  to  hold.  The  second  is  that  even  when  we  satisfy  the  conditions  for 
our  theoretical  guarantees  to  be  applicable,  we  are  confronted  with  the  question  of  choosing 
e,  which  is  an  input  to  our  algorithm.  The  nice  behavior  of  our  algorithm  with  respect  to 
e  provides  a  partial  solution  to  this  problem  by  allowing  us  to  choose  a  typical  e  for  the 
experiments.  However,  this  also  motivates  the  question  of  how  to  choose  the  value  of  e 
experimentally,  which  will  be  interesting  to  look  at  in  future  work. 

In  the  first  experiment,  we  consider  an  Ising  model  on  a  binary  tree  of  depth  5  with  a 
few  additional  edges  between  the  leaves.  The  graph  is  shown  in  Fig.  3(a).  As  remarked 
earlier,  this  graph  does  not  satisfy  the  conditions  (on  girth)  for  our  theoretical  guarantees  to 
be  applicable.  However,  our  algorithm  seems  to  perform  very  well  in  learning  this  graphical 
model.  This  is  not  surprising  because  the  graph  has  a  structure  favourable  to  our  algorithm 
(i.e.,  large  girth  and  moderate  edge  parameters,  though  they  do  not  meet  the  conditions 
for  our  guarantees  to  hold).  Fig.  3(b)  presents  the  plots  of  probability  of  success  versus 
number  of  samples  of  our  algorithm  for  various  values  of  e.  Here,  success  is  defined  as 
exact  recovery  of  the  graph  structure.  There  seems  to  exist  a  threshold  value  of  e,  call  it  e 
such  that  if  e  >  e  then  the  probability  of  success  is  very  small  and  if  e  <  e,  probability  of 
success  goes  to  1  as  the  number  of  samples  increases.  This  would  suggest  that  the  graph 
under  consideration  in  fact  satisfies  Assumption  1  with  e.  Fig.  4  presents  the  results  of  our 
algorithm  (using  a  typical  value  of  e)  comparing  it  to  the  algorithm  in  (Ravikumar  et  al., 
2010),  which  we  will  henceforth  refer  to  as  RWL. 

In  the  second  experiment,  we  evaluate  our  algorithm  on  grids  of  various  sizes.  Fig.  5 
compares  the  sample  complexity  and  computational  complexity  of  our  algorithm  to  RWL. 
From  the  figure,  it  is  clear  that  our  algorithm  has  higher  sample  complexity  but  lower 
computational  complexity  compared  to  RWL. 

Finally,  we  present  an  application  of  our  algorithm  to  model  senator  interaction  graph 
using  the  senate  voting  records,  following  (Banerjee  et  ah,  2008).  A  Yea  vote  is  treated  as 
a  1  where  as  a  Nay  vote  or  absentee  vote  is  treated  as  —1.  To  avoid  bias,  we  only  consider 
senators  who  have  voted  in  a  fraction  of  atleast  0.75  of  all  the  bills  during  the  years  2009 
and  2010.  The  output  graph  is  presented  in  Fig.  6. 

7.  Discussion 

We  developed  a  simple  greedy  algorithm  for  Markov  structure  learning.  The  algorithm  is 
simple  to  implement  and  has  low  computational  complexity.  We  then  showed  that  under 
some  non-degeneracy,  correlation  decay,  maximum  degree  and  girth  assumptions  on  the 
MRF,  our  algorithm  recovers  the  correct  graph  structure  with  0(e_4log|)  samples.  We 
then  specialize  our  conditions  to  prove  a  self-contained  result  for  the  most  popular  discrete 
graphical  model  -  the  Ising  model. 
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Figure  3:  (a)  The  graph  chosen  for  our  experiments,  binary  tree  with  a  few  additional  edges. 

(b)  Results  of  our  algorithm  for  various  values  of  e.  The  edge  parameters  {Qij)  are 
all  chosen  to  be  equal  to  0.5.  Success  is  defined  as  exact  recovery  of  the  structure. 
The  probability  of  success  on  the  y-axis  is  calculated  by  averaging  over  100  runs. 
For  a  large  value  of  e,  the  probability  of  success  of  our  algorithm  is  equal  to  0. 
However,  for  smaller  values  of  e,  the  probability  of  success  goes  to  1  as  the  number 
of  samples  increases. 


The  success  of  our  algorithm  can  be  further  improved  by  post-processing  via  pruning.  In 
particular,  as  mentioned,  the  neighborhood  of  a  node  as  estimated  by  our  algorithm  always 
includes  the  true  neighborhood  -  but  it  may  also  include  spurious  nodes.  The  latter  can  be 
then  identified  by  checking  each  node  of  the  estimated  neighborhood,  to  see  if  it  actually 
provides  a  reduction  in  conditional  entropy  over  and  above  all  the  other  nodes.  Analysis 
of  the  improvement  achieved  by  such  a  procedure  is  more  challenging,  but  it  may  be  likely 
that  doing  so  will  reveal  an  algorithm  that  can  handle  much  larger  degrees  and  smaller 
girths. 
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(a) 


Figure  4:  (a)  The  edge  parameters  are  all  chosen  to  be  equal  to  0.5.  (b)  The  edge  parameters 
are  chosen  uniformly  at  random  from  {—0.5,  0.5}. 

GA  refers  to  our  algorithm.  The  probability  of  success  on  the  y-axis  is  calculated 
by  averaging  over  100  runs.  In  both  the  cases,  the  sample  complexity  of  our 
algorithm  is  slightly  higher  than  that  of  RWL.  However,  our  algorithm  is  more 
general  (i.e.,  not  specialized  for  an  Ising  model)  and  has  lower  computational 
complexity  than  RWL. 
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Appendix 

We  will  first  prove  the  lemmas  required  for  proving  Proposition  4 

Proof  [Lemma  9]  The  proof  is  by  construction.  For  each  node  v  €  V,  let  Mv(xv)  =  pvx v. 
For  the  root  node,  let  pr  =  1.  For  any  other  node  v,  let  u  be  the  parent  of  v  in  the  rooted 
tree  with  root  r.  Define  r/v  =  r^Srju-  Let  $  and  <1  be  the  potential  functions  corresponding 

to  P  and  P  respectively.  Then, 


$(ajy)  =  exp  (6uvxux. 

uvET 

=  n  exP  (Wl  yd 


uvGT 


uv  9 

T)UXUXV 


n  exp(i0  uv  |  T}uT]v%u%v) 

iveT 

II  expd0  uv  |  Mu(xu)Mv(xv)) 


wveT 

=  <h(xr, Mv( xv),v  eV\r ) 


Since  the  potential  functions  are  preserved  by  the  bijections,  so  are  the  probabilities.  ■ 
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We  will  first  prove  the  following  lemma  which  will  help  us  in  proving  Lemma  10. 

Lemma  13  Consider  a  tree  Ising  graphical  model  T  with  root  r,  set  of  leaves  L  and  all 
positive  edge  parameters.  Let  P  be  its  probability  distribution.  Then,  the  quantity  P(Xr  = 
1  |  Xl  =  xl)  is  monotonically  increasing  in  xi,  VlGi.  Moreover,  P(Xr  =  1  |  X f  =  1)  is 
monotonically  increasing  in  dij  V  {?',  j\  E  T. 

Proof  For  simplicity  of  notation,  we  define  f(xL )  —  P(Xr  =  1  |  XjJ  =  xjf).  Let  us  prove 
the  above  statement  by  induction  on  the  depth  of  the  tree.  For  a  tree  of  depth  1,  we  have 
that 

JJ  exp(#ri^) 

^  L ^  JJ  exp(6,na;i)  +  JJexp(-0ria;j) 

igl  leL 

exp(9rixi) 

FI  exp(0rixi) +exp(-2drjxj)  exp(-6rixi) 

Since  9  r>  0,  f(xif)  increases  when  xj  is  changed  from  —1  to  1. 

Now,  suppose  the  statement  is  true  for  all  trees  of  depth  upto  k.  Consider  a  tree  of 
depth  k  +  1,  with  root  r.  Let  JV(r )  be  the  set  of  children  of  r.  For  every  c  E  N(r),  let  Tc 
be  the  subtree  rooted  at  c  with  the  same  edge  parameters  as  in  T  and  Lc  be  the  leaves  of 
Tc.  Let  Pc  be  the  probability  measure  corresponding  to  Tc  and  fc(xLc)  =  Pc(xc  =  1  |  xlc)- 
Then,  the  conditional  probability  of  the  root  node  can  be  written  as 


Hxl ) 


II  (exP(0  rc  )fc(xLc )  +  exp(—  0rc  )(i 
cEN(r) 

B 


fc(xLc ))) 


(9) 


where 


B  =  U  (exp (0rc)fc(xLc)  +  exp {-9rc)  (1  -  fc(xLc)))  + 

c£N(r) 

(exp (-Orc)fc{xLc)  +  exp(6Vc)  (1  -  fc{xLc))) 

cGAZ’(r) 


(9)  can  now  be  manipulated  to  obtain  (10). 


/  (xL) 


I<i 


Ad  +  K2 


9cixc)+exp(29rc) 
gz(xL)  exp(20r?)+l 


(10) 


fc{%L _ ) 

where  gz(xz)  =  i -f~(xL  )  >  anc^  ^ 1  an<^  ^ 2  >  9  are  independent  of  x l~  and  9rz-  Since  Ad  >  0 
and  9rc  >  0,  /(xl)  increases  if  fz{xLz)  increases.  So,  for  any  leaf  node,  if  its  value  changes 
from  —1  to  1,  the  corresponding  fz(,XL~)  increases  and  hence  /( xl )  increases,  proving  the 
induction  claim. 
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Using  the  same  induction  argument  as  above  and  noting  that  f(xL  =  1)  >  5,  it  can  be 
seen  that  f(xL  =  1)  is  monotonically  increasing  in  9VJ  V{z,  j}  G  T.  ■ 

Proof  [Lemma  10]  We  know  that  P(xr)  =  \  f°r  xr  =  ±1-  Clearly  any  xl  that  maxi¬ 
mizes  | P(xr  |  xl)  —  P(xr) |  should  either  minimize  or  maximize  P(xr  \  xl).  Note  also  that 
there  is  a  one-one  correspondence  between  such  configurations  (i.e.,  for  every  maximizing 
configuration,  there  exists  a  minimizing  configuration  such  that  both  of  them  maximize 
| P(xr  |  xl)  —  P(a:r)|).  From  Lemma  13,  we  know  that  xl  =  1  maximizes  P{xr  =  1  |  xl) 
and  by  symmetry  this  should  be  the  same  as  P(xr  =  —  1\xl  =  —  1)  and  equal  inaxP(.Tr  = 

XL 

—  1  |  xl)-  So,  we  can  conclude  that  \P(xr  \  xl)  —  P{xr)\  is  maximized  by  (xr  =  1,  xl  =  1).  ■ 


Lemma  14  Consider  a  tree  Ising  model  T  with  root  node  r,  set  of  leaves  L  and  maximum 
degree  D.  Let  P  he  its  probability  measure.  Suppose  the  absolute  values  of  the  edge  param¬ 
eters  are  bounded  by  \ 0ij\  <  V  {i,j}  G  T.  Then,  we  have  that  \ P{xr  \  xl)  —  P(xr) \  < 
exp(— ^pd(r,  L))  \/xr,XL- 

Proof  Using  Lemmas  9,  10  and  13,  we  can  assume  without  loss  of  generality  that  the  pa¬ 
rameters  Ojj  on  all  the  edges  are  positive  and  equal  to  (which  is  the  maximum  possible 
value),  consider  a  complete  D-ary  tree  and  concentrate  on  \P(Xr  =  1  |  Xl  =  1)  —  P{Xr  =  1)| 
For  simplicity  of  notation,  let  9  =  For  a  tree  of  depth  d,  let  a(d)  =  P(Xr  =  1  |  Xl  =  1). 
We  have  that 

aid  +  i)  = _ (exp(g)o(d)  +  exp(-g)  (1  -  a(d)))D _ 

(exp (0)a(d)  +  exp(— 9)  (1  —  a(d)))D  +  (exp(— 9)a(d)  +  exp(0)  (1  —  a(d)))D 

Using  some  algebraic  manipulations  and  substituting  the  value  of  9 ,  we  obtain 


a(d  +  1) 


1 

2 


a(d) 


1 

2 


and  the  result  follows.  ■ 

We  need  the  following  lemma  to  prove  Lemma  11. 

Lemma  15  Consider  a  tree  Ising  model  T ,  with  root  node  r,  set  of  leaves  L  and  maximum 
degree  D.  Let  P  be  its  probability  measure.  Suppose  the  absolute  values  of  the  edge  param¬ 
eters  are  bounded  by  \9ij\  <  V  {i,j}  G  T.  Then,  Vc  such  that  c  is  a  child  of  r,  we  have 
that  | P(xc  |  xt,xl)  —  P(xc  |  xr)\  <  4exp(—  lTf^Ld(r,  L))  \/xr,  Xj,XL ■ 

Proof  Using  Lemma  9  we  can  assume  without  loss  of  generality  that  the  parameters  9tj 
on  all  the  edges  are  positive.  (xc,xr)  can  take  values  (±1,±1).  For  each  of  those  values, 
the  value  of  xl  that  maximizes  \P{xc  \  xt,xl)  — 

P(xc  |  xr)\  either  maximizes  or  minimizes  P(xc  \  xt,xl)-  Noting  from  (a  slight  extension 
to)  Lemma  13  that  P(xc  \  xt,xl)  is  monotonic  in  xl,  it  suffices  to  consider  the  eight  pos¬ 
sibilities  |  P(XC  =  ±1  |  Xr  =  ±1  ,Xl  =  ±1)  — 

P(XC  =  ±1  |  Xr  =  ±1)|.  We  show  how  to  calculate  the  above  value  for  xc  =  1 ,  xr  =  1  ,xl  = 
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1.  Interested  readers  can  check  that  the  conclusions  below  apply  to  all  the  other  cases  as 
well.  Using  Lemma  13,  we  can  assume  that  the  parameters  9tJ  on  all  the  edges  except 
the  edge  {r,  c}  are  equal  to  y  and  consider  a  complete  D-ary  tree.  Let  9  =  9rc.  We 


Let  d  be  the  depth  of  the  tree  and 

ex.p(6)a(d—  1) 


know  that  P(XC  =  1  \  Xr  =  1)  =  exp(^p{_e) 

b(d)  4  P{Xc  =  1\  Xr  =  1,XL  =  1).  We  have  b(d )  =  where 

a(d)  is  as  defined  in  Lemma  14.  Using  some  algebraic  manipulations,  it  can  be  shown  that 
b(d )  —  exP(fl)+cxp(-e)  1  <  2  | a(d  —  1)  —  ||.  Using  Lemma  14  finishes  the  proof.  ■ 


Proof  [Lemma  11]  Using  Lemma  15,  we  have 

\P(xr  ,XNl(r),XN2(r)  |  XL)  -  P(xr,XN l(r) ,  £/V2(r) )  | 

P(xr  |  XL)  P(xj  \xr,xL)  P(xk\Xj,XL) 

j£N !(r)  k&N2(r) 

~  P(Xr)  ]^[  P(Xj  |  Xr)  P{xk 

jGN^fr)  k£N2(r) 

<  2d2+3  exp  (d(r,L)  -  1)) 

=  cexp  -d(r,L 


proving  the  result. 


Proof  [Proposition  4]  Let  I  =  {?'}  U  N1^)  U  N2(i).  Let  B  be  a  set  that  separates  I  and 
A  such  that  d(I,B)  =  min(d(i,yl),  |  —  1).  Let  J  be  the  component  of  nodes  containing  I 
when  the  graph  is  separated  by  B.  We  know  that  the  induced  subgraph  on  J  U  B  is  a  tree. 
Applying  Lemma  11  on  this  tree  and  using  Lemma  8,  we  obtain  | P(xj  \  xb)  —  P(xi  \  xb) \  < 
2cexp(— l-^-d(I,  B))  \/xi,xb,xb ■  Since  P(xj)  is  a  weighted  average  of  P(xj  \  xb)  for 
various  xb,  we  have 

log  2 

| P(xi  |  xB)  -  P{x. r)|  <  2cexp( - XB 

The  result  then  follows  since  P(xj  \  xa)  is  a  weighted  average  of  P(xj  \xb)-  ■ 

Proof  [Lemma  12]  Let  the  graphical  model  be  denoted  by  G(V,  E),  $>(xi,  Xj)  =  exp (9ijXiXj) 
denote  the  potential  on  edge  {i,  j}  when  Xt  =  Xi  and  Xj  =  Xj  and  <&(xa)  denote  the  poten¬ 
tial  due  to  all  edges  with  both  vertices  in  A  when  Xa  =  xa,  VA  C  V.  In  the  following,  we 
assume  that  the  girth  of  the  graph  is  g  >  4.  Consider  a  node  i  and  a  subset  of  its  neighbors 
j i,  •  •  •  ,jk,z  and  a  node  w  which  is  a  neighbor  of  z.  We  know  that  the  pairwise  potentials 
satisfy  exp(-y)  <  $(xi,xj)  <  exp(7).  Let  E  =  E  \  {{*,  ji},  •  •  •  ,{i,jk},{i,z},{z,w}} 
and  consider  the  graph  G(V,  E )  with  the  same  potentials  on  all  edges  as  in  G.  Let 
A  =  {i,j i,---  ,jk,z,w}  and  choose  any  other  set  B  C  V.  Let  P  and  P  be  the  proba¬ 
bility  mass  functions  corresponding  to  G  and  G  respectively.  Similarly  let  d(i,j)  and  d(i,j) 
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be  the  distance  between  i  and  j  in  G  and  G  respectively.  Suppose  further  that  d(i,  B)  =  d. 
Then,  d(i,  B)  >  d(A,  B )  =  d.  Note  that, 


P(xA,xB ) = 


1  P(xA,xB ) 

Z  $(xa) 


(11) 


p(xA)  v 

where  Z  is  an  appropriate  normalizing  constant.  Note  that  i  P(xa )  =  1- 

#A  #A 

It  follows  from  this  that  exp  (—7)  <  i  <  exp  (7).  Using  (11),  the  hypothesis  that  an  Ising 

Zj 

model  has  almost  exponential  correlation  decay,  we  obtain  the  following  inequalities  after 
some  algebraic  manipulations, 

| P(xA,xB)  -  P(xA)P(xB) |  <  c2D+3exp(47)exp(-amin(d,  ^—^))P{xB)  (12) 


P(xB)  >  exp(— 27)  ^1  —  2D+2cexp(— amin(d,  ^  ^  ~))  \  P{xB)  (13) 

\/xa ,  x B .Combining  (12)  and  (13),  we  obtain 

\P(xA,xB)  -  P{xA)P{xB) I  <  c2D+3exp(67)- -  ** nnn(d,  ^  2  ))  p(XB) 

1  —  2^+zcexp(— amm(a,  ^-o-)) 


and  subsequently  by  marginalizing,  we  obtain 


|-P(xi,xs)  -  P{xi)P{xB) |  <  c22£>+4  exp (67) 


exp(— a  min(d,  ^^)) 
1  —  2D+2cexp(— amin(d, 


P(xB) 


Let  A'  =  A  \  {i}.  Since  d(i,  A')  =  2,  we  have  that  d(i,  A!)  >  g  —  2.  So,  3  B  C  V  separating 
i  and  A'  in  G  such  that  d(i ,  B )  >  2^.  Then,  V  Xj,  x^' 


|P(Xj  |  xA>)  ~  P(Xi)\ 


=  ^  I  XB)  ~  Hxij)  P(XB  |  XA') 

XB 

<  c22D+4  exp(67)  °  2  }9_2 

7  1— 2^+^cexp(— ) 

<  2-(b+6)  exp(-27 (£>  +  1))  |sinh(2/3) |  =  e 


(14) 


where  the  last  inequality  follows  from  the  lower  bound  on  girth  g  in  the  hypothesis. 

Now  consider  the  graph  G(V,  E)  where  E  =  {{*,  Ji},  •  •  •  ,  {i,  jk},  {*,  z},  {z,  tc}}.  Let  the 
potentials  on  the  edges  in  G  be  the  same  as  those  in  G  and  denote  the  corresponding 
probability  mass  function  by  P.  Clearly,  we  have  the  following  relation  between  P,  P  and 
P. 

P(xa)  =  ^ P(xA)P(xA )  v  XA 


25 


where  Z  is  an  appropriate  normalizing  constant.  Using  (14)  and  the  symmetry  of  the  Ising 
model  (i.e. ,  P{x% )  =  \  for  xt  =  ±1),  we  obtain 

p(T.  I  ™  \  _  P(xitxA>) 

x[Xr  |  XA')  —  P(xA,) 

_  ^P(xi,XAl)P{Xi,XAl) 

z  E  P(xi’XA')P(%u  XA ') 

Xi  _ 

_  P(Xi,XAl)P(xi,XAl) 


< 


|  XAt) P{xA')P{Xi,XA>) 

Xi  _ 

P(xi\xA/)P(xj\xA/) 


JH) 


<  Tz^P(xi  I  a; A') 
after  some  algebraic  manipulations.  Similarly,  we  also  have 


1  -  2e  ~ 

P{Xi  |  XA')  >  n  -P(x»  |  XA') 


which  implies 


1  +  2e 


P(Xi  |  xA')  ~  P{Xi  |  XA') 


<  8e 


Finally,  letting  U*  =  U'  \  {z},  we  have, 
H(Xi  \XAo)~  H{Xi  |  XA.) 

i  |  a; aO  log 


XA' 


P{Xi  I  XA') 


P(Xi  I  XA*), 

=  YJP{xA')D{P{X.l  |  xa')||P(X,  |  xa*)) 

Xi4'  _ 

—  2  log  2  ^P(XA')^|P(X  i  |  a; a')  -  P(xi  |  xa*)|2 
xAt  a* 

=  5 1]  p(xa*)  E  p(x*  I^.)E  |P(x*  I  XA')  -  P{xi  I  XA*)|2 

3/4*  X2  3/^ 

>  2  E  P(XA*)  minP(x2  I  XA*)||U(xi  |  xa*,x2  =  -1)  -  P(x*  |  xA*,xz  =  1)|' 


XA*  ,X{ 


>1  E 


exp(— 7H) 


au* 


max  ( 0, 


exp(7U)  +  exp(— 7H) 


P(Xi  |  XA*,X2  =  -1)  -  P(Xi  |  xa*,xz  =  1) 
sinh(2/3)|  exp(— 27H) 


-  16e 

2' 

—  16e 


>1  E  P(XA* )  exp(— 27H) 

>  exp(— 67H)  sinh2(2/3) 

So,  we  have  shown  that  under  the  given  conditions,  an  Ising  model  satisfies  (3)  with  e  = 
exp(— 67H)  sinh2(2/l).  It  is  straightforward  to  note  that  the  above  proof  can  also  be 
used  to  show  that  the  Ising  model  also  satisfies  (4)  with  the  same  e,  completing  the  proof 
of  the  lemma. 
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Proof  [Theorem  7]  The  theorem  follows  directly  from  Theorem  2,  Proposition  4  and  Lemma 

12.  ■ 
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